Installing all needed packages and running all the necessary libraries
if(!require(knitr)){
install.packages("knitr", repos = "https://cran.r-project.org")
library(knitr)
}
## Loading required package: knitr
if(!require(tidyverse)){
install.packages("tidyverse")
library(tidyverse)
library(dplyr)
}
## Loading required package: tidyverse
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## âś” dplyr 1.1.4 âś” readr 2.1.5
## âś” forcats 1.0.0 âś” stringr 1.5.1
## âś” ggplot2 3.5.1 âś” tibble 3.2.1
## âś” lubridate 1.9.3 âś” tidyr 1.3.1
## âś” purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## âś– dplyr::filter() masks stats::filter()
## âś– dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
if(!require(ggforce)){
install.packages("ggforce")
library(ggforce)
}
## Loading required package: ggforce
if(!require(ggplot2)){
install.packages("ggplot2")
library(ggplot2)
}
if(!require(naniar)){
install.packages("naniar")
library(naniar)
}
## Loading required package: naniar
if(!require(dplyr)){
install.packages("dplyr")
library(dplyr)
}
if(!require(car)){
install.packages("car")
library(car)
}
## Loading required package: car
## Loading required package: carData
##
## Attaching package: 'car'
##
## The following object is masked from 'package:dplyr':
##
## recode
##
## The following object is masked from 'package:purrr':
##
## some
# read initial data
studentperformance.initial <- read.csv("StudentPerformanceFactors.csv")
head(studentperformance.initial, 3)
## Hours_Studied Attendance Parental_Involvement Access_to_Resources
## 1 23 84 Low High
## 2 19 64 Low Medium
## 3 24 98 Medium Medium
## Extracurricular_Activities Sleep_Hours Previous_Scores Motivation_Level
## 1 No 7 73 Low
## 2 No 8 59 Low
## 3 Yes 7 91 Medium
## Internet_Access Tutoring_Sessions Family_Income Teacher_Quality School_Type
## 1 Yes 0 Low Medium Public
## 2 Yes 2 Medium Medium Public
## 3 Yes 2 Medium Medium Public
## Peer_Influence Physical_Activity Learning_Disabilities
## 1 Positive 3 No
## 2 Negative 4 No
## 3 Neutral 4 No
## Parental_Education_Level Distance_from_Home Gender Exam_Score
## 1 High School Near Male 67
## 2 College Moderate Female 61
## 3 Postgraduate Near Male 74
# clean data
# check if there are any NA values
sum(is.na(studentperformance.initial))
## [1] 0
sum(complete.cases(studentperformance.initial))
## [1] 6607
#still -- saw some values that were blank, so defined them
common_na_strings <- c("", " ", "NA", "N/A", "-", "null", "NULL")
#applied this string to the entire data set
studentperformance.initial %>%
miss_scan_count(search = common_na_strings)
## # A tibble: 20 Ă— 2
## Variable n
## <chr> <int>
## 1 Hours_Studied 6607
## 2 Attendance 6607
## 3 Parental_Involvement 6607
## 4 Access_to_Resources 6607
## 5 Extracurricular_Activities 6607
## 6 Sleep_Hours 6607
## 7 Previous_Scores 6607
## 8 Motivation_Level 6607
## 9 Internet_Access 6607
## 10 Tutoring_Sessions 6607
## 11 Family_Income 6607
## 12 Teacher_Quality 6607
## 13 School_Type 6607
## 14 Peer_Influence 6607
## 15 Physical_Activity 6607
## 16 Learning_Disabilities 6607
## 17 Parental_Education_Level 6607
## 18 Distance_from_Home 6607
## 19 Gender 6607
## 20 Exam_Score 6607
# proved that there are blanks -- replace blanks across all columns
studentperformance.na <- studentperformance.initial %>%
mutate(across(everything(), ~ ifelse(trimws(.) == "", NA, .)))
# display all NA blanks in table
table(is.na(studentperformance.na))
##
## FALSE TRUE
## 131905 235
# make sure the NA replacement worked by looking at new data frame
head(studentperformance.na, 3)
## Hours_Studied Attendance Parental_Involvement Access_to_Resources
## 1 23 84 Low High
## 2 19 64 Low Medium
## 3 24 98 Medium Medium
## Extracurricular_Activities Sleep_Hours Previous_Scores Motivation_Level
## 1 No 7 73 Low
## 2 No 8 59 Low
## 3 Yes 7 91 Medium
## Internet_Access Tutoring_Sessions Family_Income Teacher_Quality School_Type
## 1 Yes 0 Low Medium Public
## 2 Yes 2 Medium Medium Public
## 3 Yes 2 Medium Medium Public
## Peer_Influence Physical_Activity Learning_Disabilities
## 1 Positive 3 No
## 2 Negative 4 No
## 3 Neutral 4 No
## Parental_Education_Level Distance_from_Home Gender Exam_Score
## 1 High School Near Male 67
## 2 College Moderate Female 61
## 3 Postgraduate Near Male 74
#omit all NA values from the data
cleaned_data <- na.omit(studentperformance.na)
if (nrow(cleaned_data) == 0) {
stop("All rows contain missing values. Cannot proceed with model.")
}
#neatly organize and display cleaned data
str(cleaned_data)
## 'data.frame': 6378 obs. of 20 variables:
## $ Hours_Studied : int 23 19 24 29 19 19 29 25 17 23 ...
## $ Attendance : int 84 64 98 89 92 88 84 78 94 98 ...
## $ Parental_Involvement : chr "Low" "Low" "Medium" "Low" ...
## $ Access_to_Resources : chr "High" "Medium" "Medium" "Medium" ...
## $ Extracurricular_Activities: chr "No" "No" "Yes" "Yes" ...
## $ Sleep_Hours : int 7 8 7 8 6 8 7 6 6 8 ...
## $ Previous_Scores : int 73 59 91 98 65 89 68 50 80 71 ...
## $ Motivation_Level : chr "Low" "Low" "Medium" "Medium" ...
## $ Internet_Access : chr "Yes" "Yes" "Yes" "Yes" ...
## $ Tutoring_Sessions : int 0 2 2 1 3 3 1 1 0 0 ...
## $ Family_Income : chr "Low" "Medium" "Medium" "Medium" ...
## $ Teacher_Quality : chr "Medium" "Medium" "Medium" "Medium" ...
## $ School_Type : chr "Public" "Public" "Public" "Public" ...
## $ Peer_Influence : chr "Positive" "Negative" "Neutral" "Negative" ...
## $ Physical_Activity : int 3 4 4 4 4 3 2 2 1 5 ...
## $ Learning_Disabilities : chr "No" "No" "No" "No" ...
## $ Parental_Education_Level : chr "High School" "College" "Postgraduate" "High School" ...
## $ Distance_from_Home : chr "Near" "Moderate" "Near" "Moderate" ...
## $ Gender : chr "Male" "Female" "Male" "Male" ...
## $ Exam_Score : int 67 61 74 71 70 71 67 66 69 72 ...
## - attr(*, "na.action")= 'omit' Named int [1:229] 34 128 241 276 317 360 381 397 403 409 ...
## ..- attr(*, "names")= chr [1:229] "34" "128" "241" "276" ...
head(cleaned_data, 3)
## Hours_Studied Attendance Parental_Involvement Access_to_Resources
## 1 23 84 Low High
## 2 19 64 Low Medium
## 3 24 98 Medium Medium
## Extracurricular_Activities Sleep_Hours Previous_Scores Motivation_Level
## 1 No 7 73 Low
## 2 No 8 59 Low
## 3 Yes 7 91 Medium
## Internet_Access Tutoring_Sessions Family_Income Teacher_Quality School_Type
## 1 Yes 0 Low Medium Public
## 2 Yes 2 Medium Medium Public
## 3 Yes 2 Medium Medium Public
## Peer_Influence Physical_Activity Learning_Disabilities
## 1 Positive 3 No
## 2 Negative 4 No
## 3 Neutral 4 No
## Parental_Education_Level Distance_from_Home Gender Exam_Score
## 1 High School Near Male 67
## 2 College Moderate Female 61
## 3 Postgraduate Near Male 74
# change the class type of all categorical variables to factor levels so they could be seen as categories when analyzing using r. continuous variables being the integer type is good, so it was kept the same
cleaned_data[] <- lapply(cleaned_data, function(x) {
if (is.character(x))
as.factor(x)
else
x
})
sapply(cleaned_data, class)
## Hours_Studied Attendance
## "integer" "integer"
## Parental_Involvement Access_to_Resources
## "factor" "factor"
## Extracurricular_Activities Sleep_Hours
## "factor" "integer"
## Previous_Scores Motivation_Level
## "integer" "factor"
## Internet_Access Tutoring_Sessions
## "factor" "integer"
## Family_Income Teacher_Quality
## "factor" "factor"
## School_Type Peer_Influence
## "factor" "factor"
## Physical_Activity Learning_Disabilities
## "integer" "factor"
## Parental_Education_Level Distance_from_Home
## "factor" "factor"
## Gender Exam_Score
## "factor" "integer"
studentperformance <- cleaned_data
head(studentperformance, 3)
## Hours_Studied Attendance Parental_Involvement Access_to_Resources
## 1 23 84 Low High
## 2 19 64 Low Medium
## 3 24 98 Medium Medium
## Extracurricular_Activities Sleep_Hours Previous_Scores Motivation_Level
## 1 No 7 73 Low
## 2 No 8 59 Low
## 3 Yes 7 91 Medium
## Internet_Access Tutoring_Sessions Family_Income Teacher_Quality School_Type
## 1 Yes 0 Low Medium Public
## 2 Yes 2 Medium Medium Public
## 3 Yes 2 Medium Medium Public
## Peer_Influence Physical_Activity Learning_Disabilities
## 1 Positive 3 No
## 2 Negative 4 No
## 3 Neutral 4 No
## Parental_Education_Level Distance_from_Home Gender Exam_Score
## 1 High School Near Male 67
## 2 College Moderate Female 61
## 3 Postgraduate Near Male 74
# create new column for student progress
studentperformance$Progress <- studentperformance$Exam_Score - studentperformance$Previous_Scores
head(studentperformance, 3)
## Hours_Studied Attendance Parental_Involvement Access_to_Resources
## 1 23 84 Low High
## 2 19 64 Low Medium
## 3 24 98 Medium Medium
## Extracurricular_Activities Sleep_Hours Previous_Scores Motivation_Level
## 1 No 7 73 Low
## 2 No 8 59 Low
## 3 Yes 7 91 Medium
## Internet_Access Tutoring_Sessions Family_Income Teacher_Quality School_Type
## 1 Yes 0 Low Medium Public
## 2 Yes 2 Medium Medium Public
## 3 Yes 2 Medium Medium Public
## Peer_Influence Physical_Activity Learning_Disabilities
## 1 Positive 3 No
## 2 Negative 4 No
## 3 Neutral 4 No
## Parental_Education_Level Distance_from_Home Gender Exam_Score Progress
## 1 High School Near Male 67 -6
## 2 College Moderate Female 61 2
## 3 Postgraduate Near Male 74 -17
#see what kind of variables Hours_Studied and Tutoring Sessions Are
class(studentperformance$Hours_Studied)
## [1] "integer"
class(studentperformance$Tutoring_Sessions)
## [1] "integer"
#since these are integer variables: calculate the mean, median, mode and frequency distribution
# list of interest variables to calculate statistics for
integer.interest.variables <- c("Hours_Studied", "Tutoring_Sessions")
#loop through each integer variable and calculate the mean
for (var in integer.interest.variables) {
mean_value <- mean(studentperformance[[var]])
# print the results
cat("Mean for", var, ":", mean_value, "\n")
}
## Mean for Hours_Studied : 19.97711
## Mean for Tutoring_Sessions : 1.495296
#loop through each interest variable and calculate the median
for (var in integer.interest.variables) {
median_value <- median(studentperformance[[var]])
# print the results
cat("Median for", var, ":", median_value, "\n")
}
## Median for Hours_Studied : 20
## Median for Tutoring_Sessions : 1
# calculate mode
# define a function for mode calculation
mode <- function(x) {
uniq_x <- unique(x)
uniq_x[which.max(tabulate(match(x, uniq_x)))] # return the most frequent value
}
#loop through each interest variable and calculate the mode
for (var in integer.interest.variables) {
mode_value <- mode(studentperformance[[var]])
# print the results
cat("Mode for", var, ":", mode_value, "\n")
}
## Mode for Hours_Studied : 20
## Mode for Tutoring_Sessions : 1
#calculate frequency distribution
#loop through each interest variable and calculate the frequency distribution
for (var in integer.interest.variables) {
freq_table <- table(studentperformance[[var]])
# print the frequency table
cat("Frequency distribution for", var, ":\n")
print(freq_table)
cat("---------------------------------\n")
}
## Frequency distribution for Hours_Studied :
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## 3 4 12 16 21 17 51 55 79 91 140 189 213 257 303 337 370 393 425 448
## 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 43
## 420 386 389 343 281 255 217 169 131 119 75 54 37 27 20 10 6 7 6 1
## 44
## 1
## ---------------------------------
## Frequency distribution for Tutoring_Sessions :
##
## 0 1 2 3 4 5 6 7 8
## 1458 2111 1586 800 296 101 18 7 1
## ---------------------------------
#create histograms
#loop through each interest variable and calculate the frequency distribution
for (var in integer.interest.variables) {
freq_hist <- hist(studentperformance[[var]], main = var, xlab = var)
# output histograms
freq_hist
}
Looking at our integer variables of interest here, it looks like an average amount of all surveyed students study for about 19.9 hours and have 1.5 tutoring sessions. This value does not change much throughout the other averages of median and mode: with study hours being around 20 hours for both and tutoring sessions being 1. This tells that most students without any differentiation by other variables study around those averages. Looking through the frequency distribution data, it looks like most students study in the data set study between 11-30 hours. It also looks like most students get between 0-2 tutoring sessions, with a smaller range being from 3-5.
# barplot visualization of variables of interest: Extracurrciular_Activities, Access_to_Resources, Hours Studied, Parental_Involvement, Tutoring_Sessions, School_Type, Parental_Educational_Level
#create the interest variables that we need to graph
interest.variables <- c("Extracurricular_Activities", "Access_to_Resources", "Hours_Studied",
"Parental_Involvement", "Tutoring_Sessions", "School_Type",
"Parental_Education_Level")
# loop through each variable and create a bar chart
for (var in interest.variables) {
barplot <- ggplot(studentperformance, aes(x = !!sym(var), fill = Family_Income)) +
geom_bar(position = "dodge") +
labs(title = paste("Distribution of", var, "by Family Income"),
x = var,
y = "Count") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# print the plot
print(barplot)
}
Looking through the visualization of seven interest variables in the data set that can be indicators of educational progress in students, all differentiated by income levels of the family, there are some interesting results. It does not seem like our original ideas about higher income leading to more access to educational resources is exactly true. In fact, students that come from a background that is of higher income seem to participate in less extracurricular activities as compared to their low and medium income peers. This same trend is reflected through the other interest variables as well: There is no overwhelming amount of students with high income backgrounds that access to resources, high income students spend less time studying, have less parental involvement, have lower amounts of tutoring sessions, are enrolled at lower rates in both public and private schools, and their parental educational level is at less proportions than those compared to low and middle income families.
#mosaic chart visualization of our (5) categorical variables of interest: Extracurrciular_Activities, Access_to_Resources, Parental_Involvement, School_Type, Parental_Educational_Level
# create a new set of variables for just categorical ones we are testing
cat.interest.variables <- c("Extracurricular_Activities", "Access_to_Resources",
"Parental_Involvement", "School_Type", "Parental_Education_Level")
#install package to be able to plot a mosaic plot
if(!require(ggmosaic)){
install.packages("ggmosaic", repos = "https://cran.r-project.org")
library(ggmosaic)
}
## Loading required package: ggmosaic
#change to as factor
studentperformance$Family_Income <- as.factor(studentperformance$Family_Income)
studentperformance[[var]] <- as.factor(studentperformance[[var]])
# loop through each variable and create a mosaic plot
for (var in cat.interest.variables) {
mosaic.plot <- ggplot(studentperformance) +
geom_mosaic(aes(weight = 1, x = product(Family_Income), fill = !!sym(var)), na.rm = TRUE) +
labs(title = paste("Mosaic Plot of", var, "by Family Income"),
x = "Family Income",
y = "Proportion",
fill = var) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# print the plot
print(mosaic.plot)
}
## Warning: The `scale_name` argument of `continuous_scale()` is deprecated as of ggplot2
## 3.5.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: The `trans` argument of `continuous_scale()` is deprecated as of ggplot2 3.5.0.
## ℹ Please use the `transform` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: `unite_()` was deprecated in tidyr 1.2.0.
## ℹ Please use `unite()` instead.
## ℹ The deprecated feature was likely used in the ggmosaic package.
## Please report the issue at <https://github.com/haleyjeppson/ggmosaic>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Looking through just our five categorical variables of interest, we decided to go with a mosaic plot to represent the different data. Here, it looks like family income does not have a very significant effect on the proportions of the data. Really, it actually looks like the only income level that has different proportions than the others are low income families, with most differences in access to resources, parental involvement, and school type. This tells us that it is probable that low income families were the most surveyed in this study. These graphs also help us to know that there is not a significant proportional difference in the educational affecting variables between income groups!
#boxplot chart visualization of our (2) integer variables of interest: Hours_Studied and Tutoring Sessions
#identify variables of interest
integer.interest.variables
## [1] "Hours_Studied" "Tutoring_Sessions"
x_variable <- "Family_Income"
# loop through each variable of interest to create boxplots
for (var in integer.interest.variables) {
boxplot <- ggplot(studentperformance, aes(x = .data[[x_variable]], y = .data[[var]])) +
geom_boxplot(fill = "skyblue", color = "darkblue", outlier.color = "red") +
theme_minimal() +
labs(title = paste(var, "vs", x_variable),
x = x_variable,
y = var) +
theme(plot.title = element_text(hjust = 0.5, size = 14),
axis.title = element_text(size = 12))
# print the plot
print(boxplot)
}
Here, we used boxplots for our two integer variables of interest, because we wanted to have a visual representation of the averages differentiated across family income. In the case of the hours studied, it seems like there is no difference between the data across family income at all from the 25th to the 75th percentiles. It does seem like, however, that low and medium income families do seem to have many more outliers as compared to high family income. This tells us that the spread for those two income types is wider than high income spread. Looking through the tutoring sessions, the 25th to 75th percentile spread seems to be the same, but once again, the outliers for medium family income tells us that the spread is different than low and high family income.
#check for assumptions for chi-square
# loop through each variable to check assumptions
for (var in cat.interest.variables) {
# Create a contingency table
contingency_table <- table(studentperformance[[var]], studentperformance$Family_Income)
# Check independence assumption (cannot directly test but mention it's assumed)
cat("Checking assumptions for", var, "vs Family_Income:\n")
cat("Independence of observations: Assumed based on data collection.\n")
# Calculate expected frequencies
expected_freq <- chisq.test(contingency_table)$expected
# Check if all expected frequencies are >= 5
if (all(expected_freq >= 5)) {
cat("All expected frequencies are >= 5. The Chi-Square test assumption is met.\n")
} else {
cat("Warning: Some expected frequencies are < 5. The Chi-Square test assumption is violated.\n")
cat("Expected frequencies:\n")
print(expected_freq)
}
cat("\n-------------------------------------------\n")
}
## Checking assumptions for Extracurricular_Activities vs Family_Income:
## Independence of observations: Assumed based on data collection.
## All expected frequencies are >= 5. The Chi-Square test assumption is met.
##
## -------------------------------------------
## Checking assumptions for Access_to_Resources vs Family_Income:
## Independence of observations: Assumed based on data collection.
## All expected frequencies are >= 5. The Chi-Square test assumption is met.
##
## -------------------------------------------
## Checking assumptions for Parental_Involvement vs Family_Income:
## Independence of observations: Assumed based on data collection.
## All expected frequencies are >= 5. The Chi-Square test assumption is met.
##
## -------------------------------------------
## Checking assumptions for School_Type vs Family_Income:
## Independence of observations: Assumed based on data collection.
## All expected frequencies are >= 5. The Chi-Square test assumption is met.
##
## -------------------------------------------
## Checking assumptions for Parental_Education_Level vs Family_Income:
## Independence of observations: Assumed based on data collection.
## All expected frequencies are >= 5. The Chi-Square test assumption is met.
##
## -------------------------------------------
All assumptions for the Chi-Square Test have been met to run for categorical interest variables.
#chi-square analysis of our categorical variables of interest: Extracurricular_Activities, Access_to_Resources, Parental_Involvement, School_Type, Parental_Educational_Level
# loop through each variable and run the Chi-Square test
for (var in cat.interest.variables) {
contingency_table <- table(studentperformance[[var]], studentperformance$Family_Income)
chi_test <- chisq.test(contingency_table)
# display the results
cat("Chi-Square Test for", var, "vs Family_Income:\n")
print(chi_test)
cat("\n-------------------------------------------\n")
}
## Chi-Square Test for Extracurricular_Activities vs Family_Income:
##
## Pearson's Chi-squared test
##
## data: contingency_table
## X-squared = 0.87651, df = 2, p-value = 0.6452
##
##
## -------------------------------------------
## Chi-Square Test for Access_to_Resources vs Family_Income:
##
## Pearson's Chi-squared test
##
## data: contingency_table
## X-squared = 4.0107, df = 4, p-value = 0.4046
##
##
## -------------------------------------------
## Chi-Square Test for Parental_Involvement vs Family_Income:
##
## Pearson's Chi-squared test
##
## data: contingency_table
## X-squared = 4.4075, df = 4, p-value = 0.3537
##
##
## -------------------------------------------
## Chi-Square Test for School_Type vs Family_Income:
##
## Pearson's Chi-squared test
##
## data: contingency_table
## X-squared = 0.91345, df = 2, p-value = 0.6334
##
##
## -------------------------------------------
## Chi-Square Test for Parental_Education_Level vs Family_Income:
##
## Pearson's Chi-squared test
##
## data: contingency_table
## X-squared = 0.98463, df = 4, p-value = 0.9121
##
##
## -------------------------------------------
For all categorical interest variables: It looks to be that the p-value of this chi-square test for the categorical variables is larger than 0.05. This means that there is not much significant data to signify a correlation between the variables and family income. This conclusion has actually also been shown by the visualizations that we conducted previously. It does not seem like there is much difference between family income in the surveyed group and the variables that affect their educational variables. This means that if there was some difference between progress through the family income, it would not be because of changes in these specific categorical variables.
This is specifically since there does not seem to be any heavy correlation between any one of the variables with family income. So we will be testing them individually in relation to progress with ANOVA testing with independant and dependant variables. Then, we will be using simple linear regression.
# The conditions for ANOVA are independence, normality, and homoskedasticity. We will assume that there is independence between and within family income groups, because it's not likely that one family's income is impacted by another family's income.
# mean, standard deviation, number of observations, and progress for each family income group
studentperformance %>%
group_by(Family_Income) %>%
summarize(
Mean = mean(Progress),
SD = sd(Progress),
n = n(),
SE = sd(Progress) / sqrt(n())
)
## # A tibble: 3 Ă— 5
## Family_Income Mean SD n SE
## <fct> <dbl> <dbl> <int> <dbl>
## 1 High -6.64 14.3 1230 0.408
## 2 Low -8.37 14.3 2582 0.282
## 3 Medium -7.82 14.1 2566 0.279
# histogram
studentperformance %>%
ggplot(aes(x = Progress)) +
geom_histogram(aes(fill = Family_Income)) +
facet_grid(cols = vars(Family_Income))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# boxplot
studentperformance %>%
ggplot(aes(x = Family_Income, y = Progress)) +
geom_boxplot(aes(fill = Family_Income)) +
labs(
x = "Family Income",
y = "Progress"
)
Although the boxplots show that there is one outlier for the medium family income group, we will not remove this since it’s not as extreme to the point that it changes the variability of data drastically and is unlikely that it’s caused by a measurement error. Overall, data from all family income groups seem to be nearly normal and the variability across those groups seem to be not significantly different, meeting the conditions for an ANOVA test.
# Fit the model
family_income_fit <- lm(Progress ~ Family_Income, data = studentperformance)
# Perform ANOVA
family_income_anova <- anova(family_income_fit)
cat("\nANOVA Results for Family_Income:\n")
##
## ANOVA Results for Family_Income:
print(family_income_anova)
## Analysis of Variance Table
##
## Response: Progress
## Df Sum Sq Mean Sq F value Pr(>F)
## Family_Income 2 2500 1250.19 6.1672 0.00211 **
## Residuals 6375 1292317 202.72
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Check the p-value from ANOVA
p_value <- family_income_anova$`Pr(>F)`[1]
if (p_value < 0.05) {
cat("\nSignificant effect detected (p < 0.05). Proceeding with post hoc tests...\n")
# Residual diagnostics: Normality test
ks_test <- ks.test(residuals(family_income_fit), "pnorm",
mean = mean(residuals(family_income_fit)),
sd = sd(residuals(family_income_fit)))
cat("\nKolmogorov-Smirnov test for normality of residuals:\n")
print(ks_test)
# Homogeneity of variances: Levene's test
levene_test <- car::leveneTest(Progress ~ Family_Income, data = studentperformance)
cat("\nLevene's test for homogeneity of variances:\n")
print(levene_test)
# Post hoc tests
if (levene_test$`Pr(>F)`[1] > 0.05) {
# Variances are equal: Use Tukey's HSD
posthoc_result <- TukeyHSD(aov(family_income_fit))
cat("\nTukeyHSD Post-Hoc Test Results:\n")
print(posthoc_result)
} else {
# Variances are not equal: Use Pairwise t-tests with p-value adjustment
posthoc_result <- pairwise.t.test(
studentperformance$Progress,
studentperformance$Family_Income,
p.adjust.method = "holm"
)
cat("\nPairwise t-test with Holm correction (unequal variances):\n")
print(posthoc_result)
}
} else {
cat("\nNo significant effect detected (p >= 0.05). Post hoc tests are not necessary.\n")
}
##
## Significant effect detected (p < 0.05). Proceeding with post hoc tests...
## Warning in ks.test.default(residuals(family_income_fit), "pnorm", mean =
## mean(residuals(family_income_fit)), : ties should not be present for the
## one-sample Kolmogorov-Smirnov test
##
## Kolmogorov-Smirnov test for normality of residuals:
##
## Asymptotic one-sample Kolmogorov-Smirnov test
##
## data: residuals(family_income_fit)
## D = 0.05898, p-value < 2.2e-16
## alternative hypothesis: two-sided
##
##
## Levene's test for homogeneity of variances:
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 2 1.1417 0.3193
## 6375
##
## TukeyHSD Post-Hoc Test Results:
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = family_income_fit)
##
## $Family_Income
## diff lwr upr p adj
## Low-High -1.7323931 -2.8887556 -0.57603060 0.0013013
## Medium-High -1.1774214 -2.3349466 -0.01989624 0.0451180
## Medium-Low 0.5549717 -0.3754098 1.48535316 0.3418028
There is some significance in the difference of exam score progress between family income groups.
for (var in cat.interest.variables){
studentperformance %>%
group_by(!!sym(var)) %>% # sym() makes a string into a symbol, !! unquotes
summarize(
Mean = mean(Progress),
SD = sd(Progress),
n = n(),
SE = sd(Progress) / sqrt(n())
) %>%
print()
hist <- studentperformance %>%
ggplot(aes(x = Progress)) +
geom_histogram(aes(fill = !!sym(var))) +
facet_grid(cols = vars(!!sym(var)))
print(hist)
bp <- studentperformance %>%
ggplot(aes(x = var, y = Progress)) +
geom_boxplot(aes(fill = !!sym(var))) +
labs(
x = var,
y = "Progress"
)
print(bp)
}
## # A tibble: 2 Ă— 5
## Extracurricular_Activities Mean SD n SE
## <fct> <dbl> <dbl> <int> <dbl>
## 1 No -8.07 14.2 2571 0.280
## 2 Yes -7.64 14.3 3807 0.232
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## # A tibble: 3 Ă— 5
## Access_to_Resources Mean SD n SE
## <fct> <dbl> <dbl> <int> <dbl>
## 1 High -7.28 14.2 1900 0.326
## 2 Low -8.10 14.0 1274 0.391
## 3 Medium -8.01 14.4 3204 0.254
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## # A tibble: 3 Ă— 5
## Parental_Involvement Mean SD n SE
## <fct> <dbl> <dbl> <int> <dbl>
## 1 High -6.66 14.1 1836 0.329
## 2 Low -9.31 14.3 1291 0.399
## 3 Medium -7.87 14.2 3251 0.250
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## # A tibble: 2 Ă— 5
## School_Type Mean SD n SE
## <fct> <dbl> <dbl> <int> <dbl>
## 1 Private -7.48 14.2 1944 0.323
## 2 Public -7.96 14.2 4434 0.214
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## # A tibble: 3 Ă— 5
## Parental_Education_Level Mean SD n SE
## <fct> <dbl> <dbl> <int> <dbl>
## 1 College -7.24 14.1 1939 0.321
## 2 High School -8.45 14.3 3159 0.255
## 3 Postgraduate -7.1 14.2 1280 0.397
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
All variables have outliers shown on the boxplots, but we will keep them for analysis since they are not too extreme to change their variabilities drastically.
library(car) # For Levene's Test
anova_valid_cat_var <- c("Extracurricular_Activities", "Access_to_Resources",
"Parental_Involvement", "School_Type")
dependent_var <- "Progress"
# Initialize vectors to store results
p_values <- numeric(length(anova_valid_cat_var))
names(p_values) <- anova_valid_cat_var
# Store significant variables for post hoc testing
significant_vars <- list()
# Perform ANOVA for each variable
for (i in seq_along(anova_valid_cat_var)) {
var <- anova_valid_cat_var[i]
formula <- as.formula(paste(dependent_var, "~", var))
var_fit <- lm(formula, data = studentperformance)
var_anova <- anova(var_fit)
# Print ANOVA results
cat("\nANOVA Results for:", var, "\n")
print(var_anova)
# Extract p-value
p_values[i] <- var_anova$`Pr(>F)`[1]
# Store the variable if p-value < 0.05
if (p_values[i] < 0.05) {
significant_vars[[var]] <- var_fit
}
}
##
## ANOVA Results for: Extracurricular_Activities
## Analysis of Variance Table
##
## Response: Progress
## Df Sum Sq Mean Sq F value Pr(>F)
## Extracurricular_Activities 1 277 277.08 1.3647 0.2428
## Residuals 6376 1294540 203.03
##
## ANOVA Results for: Access_to_Resources
## Analysis of Variance Table
##
## Response: Progress
## Df Sum Sq Mean Sq F value Pr(>F)
## Access_to_Resources 2 766 382.76 1.8856 0.1518
## Residuals 6375 1294052 202.99
##
## ANOVA Results for: Parental_Involvement
## Analysis of Variance Table
##
## Response: Progress
## Df Sum Sq Mean Sq F value Pr(>F)
## Parental_Involvement 2 5347 2673.65 13.218 1.868e-06 ***
## Residuals 6375 1289470 202.27
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## ANOVA Results for: School_Type
## Analysis of Variance Table
##
## Response: Progress
## Df Sum Sq Mean Sq F value Pr(>F)
## School_Type 1 308 308.32 1.5186 0.2179
## Residuals 6376 1294509 203.03
# Output all p-values
cat("\nP-values from ANOVA:\n")
##
## P-values from ANOVA:
print(p_values)
## Extracurricular_Activities Access_to_Resources
## 2.427687e-01 1.518173e-01
## Parental_Involvement School_Type
## 1.867522e-06 2.178779e-01
# Perform post hoc tests for significant variables
if (length(significant_vars) > 0) {
for (var in names(significant_vars)) {
cat("\nPerforming post hoc tests for:", var, "\n")
var_fit <- significant_vars[[var]]
formula <- as.formula(paste(dependent_var, "~", var))
# Check residual normality
ks_test <- ks.test(residuals(var_fit), "pnorm",
mean = mean(residuals(var_fit)),
sd = sd(residuals(var_fit)))
cat("Kolmogorov-Smirnov test for normality:\n")
print(ks_test)
# Check homogeneity of variances
levene_test <- car::leveneTest(formula, data = studentperformance)
cat("Levene's test for homogeneity of variances:\n")
print(levene_test)
if (levene_test$`Pr(>F)`[1] > 0.05) {
# Variances are equal: Use Tukey's HSD
posthoc_result <- TukeyHSD(aov(var_fit))
cat("\nTukeyHSD Post-Hoc Test:\n")
print(posthoc_result)
} else {
# Variances are not equal: Use Pairwise t-tests with corrections
posthoc_result <- pairwise.t.test(
studentperformance[[dependent_var]],
studentperformance[[var]],
p.adjust.method = "holm"
)
cat("\nPairwise t-test with Holm correction:\n")
print(posthoc_result)
}
}
} else {
cat("\nNo variables with significant ANOVA results (p < 0.05).\n")
}
##
## Performing post hoc tests for: Parental_Involvement
## Warning in ks.test.default(residuals(var_fit), "pnorm", mean =
## mean(residuals(var_fit)), : ties should not be present for the one-sample
## Kolmogorov-Smirnov test
## Kolmogorov-Smirnov test for normality:
##
## Asymptotic one-sample Kolmogorov-Smirnov test
##
## data: residuals(var_fit)
## D = 0.05957, p-value < 2.2e-16
## alternative hypothesis: two-sided
##
## Levene's test for homogeneity of variances:
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 2 0.3778 0.6854
## 6375
##
## TukeyHSD Post-Hoc Test:
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = var_fit)
##
## $Parental_Involvement
## diff lwr upr p adj
## Low-High -2.650711 -3.8616824 -1.4397399 0.0000009
## Medium-High -1.210134 -2.1834530 -0.2368142 0.0099975
## Medium-Low 1.440578 0.3437931 2.5373620 0.0059105
Only Parental_Involvement showed that there was a significant difference in progress between the groups.
model_hours <- lm(Progress ~ Hours_Studied, data = studentperformance)
# scatter plots
for (var in integer.interest.variables){
scatter <- ggplot(data = studentperformance) +
geom_point(aes(x = !!sym(var), y = Progress))
print(scatter)
}
ggplot(data = studentperformance) +
geom_point(aes(x = Hours_Studied, y = Progress)) +
geom_line(
data = fortify(model_hours),
aes(x = Hours_Studied, y = .fitted),
color = "red"
)
summary(model_hours)
##
## Call:
## lm(formula = Progress ~ Hours_Studied, data = studentperformance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -30.139 -12.007 -0.007 11.861 49.283
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -12.54792 0.61867 -20.282 < 2e-16 ***
## Hours_Studied 0.23696 0.02967 7.988 1.62e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.18 on 6376 degrees of freedom
## Multiple R-squared: 0.009908, Adjusted R-squared: 0.009752
## F-statistic: 63.8 on 1 and 6376 DF, p-value: 1.619e-15
# Simple Linear Regression Conditions Check for Final Model
# 1. Residuals vs. Fitted Values (Linearity and Homoscedasticity)
plot(model_hours, which = 1)
# 2. Normal Q-Q Plot (Normality of Residuals)
plot(model_hours, which = 2)
# 3. Scale-Location Plot (Homoscedasticity)
plot(model_hours, which = 3)
# 4. Residuals vs Leverage Plot (Influential Points)
plot(model_hours, which = 5)
Looking at the scatter plots, there does not seem to be a linear trend between the number of tutoring sessions and progress in exam scores. However, there seems to be a slight linear trend between hours studied and progress, so we performed a simple linear regression for only this variable. The significance was high, and the best fit line is represented as a red line on the last scatter plot.
Looking through all of the data and correlation, it looks like there is a relationship between the educational progress of a student based on their family income. The seven variables that we decided to test were Extracurricular Activities, Access to Resources, Hours Studied, Parental Involvement, Tutoring Sessions, School Type, and Parental Educational Level. The reason was because we thought that there was a correlation between those variables to family income, and in the end, to educational progress.
However, the only two of these variables showed a correlation with educational progress which were Parental Involvement and Hours Studied. So, we came to the conclusion that these were the two variables that could affect the educational progress of the student, even when it was not differentiated by family income. We also concluded that there was a relationship between family income and progress of a student, but we were not able to identify exactly which variable would be the one to affect that progress (it wasn’t one of our chosen variables). So, there could be some other variables in the data set that we did not test that could be affected by family income, which in turn could affect the progress of the student.
If we did this project again, we would decide to test all the variables rather than just those that we previously believe could be affected by family income. It does seem like there is some variable in the data set that is affected by family income and then affecting the educational progress. Testing all variables might help with this matter.